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Abstract. During the MaxEnt 2002 workshop in Moscow, Idaho, Tony Vignaux asked again a 
few simple questions about using Maximum Entropy or Bayesian approaches for the famous Dice 
problems which have been analyzed many times through this workshop and also in other places. 
Here, there is another analysis of these problems. I hope that, this paper will answer a few questions 
of Tony and other participants of the workshop on the situations where we can use Maximum 
Entropy or Bayesian approaches or even the cases where we can actually use both of them. 
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INTRODUCTION 

Dice problems have been analyzed many times (See mainly Ed. Jaynes papers [1, 2, 3, 4] 
and also [5, 6, 7, 8]), but it seems that still many questions are open. In this note, I will try 
to answer some of them. Before starting, we need to set up precise notation and describe 
precisely the context. 

Let's consider an imaginary die with K faces (K = 6 is the ordinary die) where on 
each face there is a number. We note these numbers g = [gi,... ,gK']- K is the number 
of elementary states and commonly, K' = K and = k, but we may also consider the 
cases where gk are any other numbers (integer or real) distinct or not. 

Let's also represent by X the variable corresponding to face number and by G the 
variable corresponding to the number written on the faces. So, X may take values 
{1,...,K} and G can take values {gi, . . . ,gx>}- Then, we can define P(X = k) and 
P(G = gu). If the g% are distinct numbers, i.e.,, K = K' , they are equal P(X = k) = 
P(G = g k ) = k , but note that E{X} = k6 k ^ E{C7} = Y^ k 9kQk- If gk is a monotone 
function of k, then it is easy to relate E{X} to E{C7}, but it may not always be the case. 

Note also that, in many dice problems, the main hypothesis is that they are fair. 
Then assigning the probability distributions becomes a combinatorial computation. For 
example, suppose we throw two dice and count the sums S of the two faces numbers. 
We want to assign the probabilities pj = P(S = Sj). 

First, we assume g k = k and note that S can take the values in the set Q = {2, 3, . . . , 12} 
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and \Q\ = 11. We must be careful here because the event S = Sj can occur q(sj) = 
6 — 1 7 — sj | times . For example, S = 2 occurs one time £^ = {(1,1)}, but 5 = 5 occurs 5 
times Ej — {(1,4), (2,3), (3,2), (4, 1)}. Now, using the basic principle of equal weight of 
statistical mechanics or insufficient reason of Laplace, we assign pj = P(S — Sj) oc \Ej\ 

which gives pj = P(S = Sj) = <?(sj)/El=U( s j)- 

In a more general case, we may have L dice and may want to define the events such 
that Ej = {(X 1 = x 1 ,...,X L = x L )} or Ej = {(X 1 = x 1: . . . ,X L = x L ) : Y.i x i = s j} 
and assign them probabilities. We may also consider the case where we throw L dice 
simultaneously iV times which is not the same as throwing N dice simultaneously L 
times, except the case where the dice are identical. We may also consider the cases 
where the number of throwing the dice are different, i.e.,, the dice I has been thrown iVj 
times. 

In some other analysis, we may not know if the die is loaded or not. This may be one 
of the questions to be answered. To be able to answer to a question, we may need to 
gather relevant data. These data may be of different form and thus, as we will see in the 
following, the way to use them to answer a question may also differ. 

Before gathering any data, we may define the question to be answered. For example, if 
we want to know if the die is loaded or not, we may be interested to infer about 0. Also, 
before gathering any data, we may make hypotheses and we may be able to translate the 
knowledge contained in these hypotheses by an a priori probability law n(0). 

For example, we may assume that the die is not loaded and assume 9i — 9 2 — ■ ■ ■ — 
9k — 1/K or choose a uniform prior for n(0) over the set {0 : Ok € [0, l]&Efc^fc = !}• 
Note that, even if they translate to a common- sounding hypothesis, mathematically 
speaking, they are not exactly the same. The former says P(J2k @k 7^ 1) = and P(9k ^ 
9 h k ^ l) = and P(a < 9 k < b) = (b - a), VI > b > a > 0. 

We may also be able to associate a likelihood function P(D\0) with the data to 
represent the amount of knowledge about the unknown parameters contained in the data. 
We will see however that this may not be easy in some cases. 

The questions may also be different: We may want to know if the die is loaded or not 
or we may want to know what is the probability that the next face be the face k, or still, 
what are the numbers written on the faces of the die. 

Let us start by a simple and easy problem which, here after, we call Problem 1. 



PROBLEM 1 

We have observed the complete data x = [xi,...,xn] and we know the number of states 
K (number of faces). The question is to estimate = [9 1: . . . , 9 K \ where 9 k = P(X = k) 
is the probability of the event face k up. 

Here is a Matlab program which simulates this data generation: 

K=6;N=100; x=round ( (K-l) *rand (N, 1) ) +1; 

and the following is an example (an N sample) of this data set: 



x = [4,2,2,2,1,5,4,5,1,4,3,3,6,6,4,6,6,4,4,1,1,2,1,6,4,2,4,2,3,2,2,6,2,2,1,6, 

5,5,6,3,5,4,2,2,4,4,4,3,6,6,4,5,2,5,3,5,2,5,1,3,3,4,3,1,3,3,5,3,3,2,5,5,3,4,4, 

3,3,3,4,1,2,4,4,5,4,5,6,5,6,5,5,5,1,1,4,1,5,2,1,6]. 



Note that, if we re-run the program, we obtain a different data set. Here are the 
results of a second run: 



x = [6,2,4,3,5,5,3,1,5,3,4,5,6,5,2,3,6,6,3,5,1,3,5,1,2,2,2,4,2,2,1,5,3,6,3,3, 

5,4,2,4,5,1,4,3,5,4,5,3,3,2,2,4,3,4,2,4,3,5,5,4,3,5,5,4,5,4,3,2,3,4,5,3,5,4,3, 

5,4,3,4,4,5,6,4,5,2,6,2,2,5,5,2,1,5,2,2,4,2,3,1,6]. 



These two data sets can represent two different experiences using the same die. 

Let n k denote the number of times the face k has shown up n k = #(X = k). Then we 
have J2k n k = N. Here is a Matlab program which computes these numbers: 

nk=zeros (K, 1 ) ; 
for k=l:K 

nk (k) =sum (x==k) ; 

end 

and here are the results for the two above data sets: 

Data set 1: n = [13, 17, 17,21, 19, 13] and 
Data set 2: n = [07, 19,21,20,25,08]. 

Now, let's start by asking about the values of 9 k . If all the 9 k are the same value, we 
can say that the die is not loaded, but if they are too different from each other, we may 
say that the die is loaded. 

A wise man can say: This is an easy problem. If each trial has been done identically 
and independently, then it is reasonable to estimate each 9 k by 9 k = n k /N and no need 
for more complex mathematics. But if we ask: How confident or (how sure) are you 
about these values? He may say: hum..., let's use the probability theory. 

Assume we know K and we have given x (and thus we now N) and assume that the 
die has been thrown always in the same manner and independently. Here then, we can 
write the complete likelihood function 



Note that in the right hand side of this expression, x is present through n k and we can 
write P(x\0) = P(n\0). 

Then the likelihood C(6) = P(x\6) and we have 



p(x\e) = l[c^^(i-e k ) 



N-n k 



(1) 



k 



ln£(0) = W ^9 k + ( N ~ n k ) Ml - Ok)} + c(n k , N) 



(2) 



k 



where c(n k ,N) = J2 k InCjv does not depend on 6. 

Knowing that each parameter 9 k E [0,1], we can choose a uniform prior n(9 k ) = 1 on 
this interval. However, we know that J2 k #fc = 1> then we can define the set = {0 : 9 k E 
[0, l]&X^fe ®k = 1} and thus define a uniform prior on this set ir(0) = 1, V0 E and zero 
elsewhere, and thus obtain the a posteriori law 

n(0\x) = m . n{e) =^U C n C (! - O) 
m(£c) ma; 11 

k 

which is defined on the same set and where m(cc) is the marginal or evidence function: 

m(x) = [ £(O)n(0)dO= [ dOT\C% 9 n k k (1 - 9 k ) N ~ n K (4) 
Je Je k 

Thus, we have 

ln7r(e|au) = ^ [n k ln9 k + (N-n k )ln(l - 9 k )\ - \nm(x). (5) 

k 

^MAP 

Now, if we are only interested by the value of 6 which has the highest probability, 
we can compute it by putting the derivative of \mr(6\x) with respect to each parameter 
9 k to obtain 

dlnn(O\x)/d9 k = -- T ^ = w -^ ) =0^9 k = -. (6) 

There is only one possible solution to this equation and there is not any ambiguity. Here 
are the results for the two above data sets: 

^MAP 

Datasetl: 6 =[0.1300,0.1700,0.1700,0.2100,0.1900,0.1300] and 

^MAP 

Dataset2: 6 =[0.0700,0.1900,0.2100,0.2000,0.2500,0.0800]. 

But, we must be careful here on the interpretations that we can give to these numerical 
values. We may want to answer the following questions: 

• Do these two data sets come from the same die? 

• Is this die loaded? 

• What is the probability of seeing face k up based on the data set 1 or the data set 2? 

• If I throw this die 100 times again, what will be the number of times I will see face 
k up? 

We have still too much to do before being able to give correct answers these questions. 



PROBLEM 2 



Assume now that, in place of x, we have only access to the data n = (n 1 ,.. .,n K ) and 
know the values of K and N (or if we knew that J2k n k = N). It is easy to see that 



we obtain exactly the same result, because (n,K,^2^ =1 n k = N) define perfectly the 
likelihood and form sufficient statistics about this problem. 

Note however that, in both cases, the likelihood C{6) is not defined for 9 k = and 
9k — 1 and consequently, the posteriorpdf 7r(0|n) may not be a proper pdf. We are going 
to analyze properly this point. 

First, noting that the likelihood function in the previous section C(6) = Y[ k K®k) and 
7r(0) = Ylk^i^k), we also have n(0\x) = Y[k n (^k\x) = Y[k n (^k\n k ). Thus, we can 
work hereafter only with the functions 1(9), tt(9), tt(9\x) and m(x) which is given by 
[9,10] 

m (x)= [ n(9)9 x {l-9Y N - x) d9. (7) 



'0 



With a uniform prior ir(9) = 1 we have 

m(x)= [ 9 x (l-9) {N ~ x) d9 = B(x + l,N-x + l) (8) 
Jo 

where B(a ) f3) is the Beta probability density function (pdf) 

f(x\aj) = — i— x"- 1 (1 - x) ^ (9) 

which is defined for a > 0, (3 > and x G [0, 1] and where 

B(aJ)= [ x^il-x^-^dx (10) 
Jo 

and we have: 

mode{x} = — a - , E{x} = — ^— and Var{x} 



a + /3-2' 1 J a + L J (a + /3) 2 (a + /3 + l)' 

Consequently, the posterior law, whose expression is n(6\n) = B(n, N — n) or equiv- 
alently 7r(^|n fc ) = B(nk,N — n^), is only bounded if iV > > 0. The MAP estimators 
9MAP = n^i do n()t exist jf nfc < 1 or if nfc > jv - 1 and if N < 2. 

The posterior mean estimators 9j? M = ^ exist if N > and the posterior variances 
Var{#} fc = exist if < n k < N and > 0. Note also that when n k = 1 the 

corresponding MAP estimator is 9 k = and when n k = N — 1 the corresponding MAP 
estimator is 9 k = 1. This shows a kind of bias of the estimator toward 9 k = and 9 k = 1 
(See Table 1). 

One may want to have a proper posterior law 7r(0|n) for the whole range of possible 
values of the parameters 9 k e [0, 1] and the data n k — [0, 1, . . . , N]. This can be done via 
other choices for the prior law. In the two previous cases, we choose a uniform a priori 
for 9 k . Some authors argued that this choice is too biased against extreme values and 
1 and proposed to use 



7i(9 k ) = [9 k (l-9 k )r 1 = 9 k 1 (l-9 k )- 1 . 



(11) 



Note also that, again with this prior, the normalization factor or the evidence function 
m(x) is given by 

m(x)= [ [9(l-6)]- 1 9 x (l-9Y N ^d9 = B(x + l,N-x+l) (12) 
Jo 

which yields n(0\n) = B(n + 1, N — n + 1) which is bounded if N — 1 > n k > (See 
Table 1). 

A more general choice is 

n(9 k )=9 a k -\l-9 k ) b - 1 (13) 

which results to 

m(x)= [ 9 a -\l-9) b - 1 9 x (l-9Y N - x ^d9 = B(x + a,N + b-x) (14) 
Jo 

which result to n(0\n) = B(n + a,N + b — n) which is bounded if N — b > n k > 1 — a. 
Then, the mean values 9 k = (n k + a)/(N + b + a) have the limit value 9 k = n k /N when 
a — b i— > 0. The following Table summarizes these points. 



TABLE 1. Different a posteriori laws corresponding to different choices of a priori laws 



a>0 


(3>0 


a + (3 


mode 


mean 


variance 




7r(0fc) = l. Tr(0 k \n k ) = B(n k ,N-n k ) 


n k 
n k >0 


N-n k 
n k <N 


N 
N>0 


nk — 1 


N 

n k > 0, 
N>0 


n k (N-n k ) 


JV-2 

n k > 0, 
N>2 


N'^(N-l) 

n k > 0, 
n k < N, 
N>1 


ir{6 k ) = 6 k - 1 {l-6)-\ ir(9 k \n k )=B(n k + l,N-n k ) 


n k + l 
n k >0 


N-n k 
n k <N 


N+l 

N>0 


N-l 

n k > 0, 
N> 1 


n k + l 


n k +l)(N-n k ) 


N+l 

n k > 0, 
N>0 


(N+l)^(N+2) 

n k > 0, 
n k <N, 
N>0 


<0k) = e a k -\l - 9) b '\ TT(6 k \n k ) = B{n k + a,N + b-n k ) 


n k + a 
n k >0 


N-n k + b 
n k <N 


N + a + b 
N>0 




rik+a 


(n k +a)(N-n k +b) 


N+a+b-2 

n k > 0, 
iV>0 


N+a+b 

n k > 0, 
N>0 


(N+a+b)^(N+a+b+l) 

n k > 0, 
n k <N, 
N>0 



Note also that, when we have the expression of the a posteriori law ir(&\n), we may 
define other estimators than the MAP or the posterior mean (PM). We may also answer 
the questions of type P(a <9 k <b). 

Note however that, all these computed numbers depend on the data and our prior 
knowledge we included. For any other data set we obtain other numbers. One may want 



to study the sensitivity of the solution to a kind of variability of data. This can be done 
by Monte Carlo simulations or by repeating the experience (but very often this may not 
be possible). 

Also, in general the sample size or, more precisely, the contrast between the sample 
size and the number of parameters, is a crucial parameter. One may want to know the 
convergence of the solution to the hypothetical case where the sample size goes to 
infinity. 

Now, let's see if we can answer some of the questions at the end of the last section. 

• What is the probability of seeing face k up based on the data set 1 or the data set 2? 
For each data set, we can compute, for example, the following quantities: 

- The most probable values 6^f AP of 9 k ; 

- The mean values 9 k MP of 9 k \ 

- The variance values v k of 9 k ; 

- The lower values a k and upper values b k for which the probabilities P(a k < 

e k <b k ) = o.9. 

• Do these two data sets come from the same die? 

We can try to answer this question by comparing the probability laws iTi(9 k \xi), 
7r 2 (^fc|a3 2 ) and7r(0fc|a;i, x 2 ). But how to do this comparison? We may try to compute 
the relative entropy 

XL(7ri7r 2 ;7r) = / 7Ti(0 fc Xi) 7r 2 (0 fc x 2 ) In r d6 k . (15) 

J ir(9 k \x 1 ,x 2 ) 

If this value is near to zero, this means that the two data sets comes from different 
dice. 

• Is this die loaded? 

We can answer this question by computing the probabilities of two hypotheses 
H l = (9 l = 9 2 = ...= 9 K ) and H = (9 k ^ 9{), i.e., P^x) and P(H \x). 

P(Hi\x) 

P(H \x) 

• If I throw this die N' = 100 times again, what will be the number of times I will 
see the face k up? 

To answer this question, there are two methods: 

i) Use the data set x = {n, N, K} to compute ir(0\x) and estimate by one of the 
previous methods (MAP, PM, ...) and then compute P(n'\6,N' = 100, K). 



II / d97i(9 k = 9\ Xl ) 

k 



(16) 
(17) 



ii) Try to find the expression of P(n'\n, N, K, N') by following 

p(n\e,N,K) = Y[c%e?(i-e k ) N - n '<, 

k 

P(n'\0,N>,K) = Y[C$9<(l-9 k ) N '-<, 

k 

P(n,ri\0,N,K,N') = \[Cl k ^e n k k+n ' k {l-O k ) N+N '- nk - n ' k , 

k 

P(n'\n,0,N,K,N') = P(n,n f \6,N,K,N')/P(n'\6,N',K) 
and then integrate out to obtain P(n'\n, N, K, N'). 

PROBLEM 3 

Now, consider the case where, the observer has given to us only a subset (rii, . . . ,tik>) 
of the whole data n = (ni, . . .,n K ) with K' < K. (He just has forgotten to count and 
report the numbers {n k) k — K' + 1, . . . , K}, but he is sure that the die has K faces. In 
this case we can only obtain an expression for the likelihood function if we know the 

total number of the observations N = J2k=i n k> N' = J2k=i n k which is 

K' 

p(x\e)(xl[e^(i-e k ) N - nk . (18) 

k=l 

Note that this likelihood expression does not depend on the parameters 
{9 k , k — K' + 1, . . . , K}. Thus, the maximum likelihood (ML) estimation approach is 
unable to propose any values for them, while the Bayesian approach and in particular 
the MAP estimation can propose a solution which depends on the choice of a priori. For 
example, with a uniform prior, we have: 

f ^ k = l,...,K' 

k = \ \n-n>) h _ K ,,, K (19) 

where the first row is common with ML and the second row is due to the uniform prior 
and the normalization. 

It is important to note that, while in the two previous cases, the prior law it (6) has a 
less important role, here the classical ML approach cannot give any answer the problem 
and the role of prior information is crucial. 



PROBLEM 4 



Another interesting case is the one where we do not know the number of states (faces of 
the die). For example, we have observed the following data: 



a: = [4,2,2,2, 1,*,4,*,1,4,3,3,*,*,4,*,*,4, 4,1,1,2, 1,*,4,2, 4,2,3,2,2, *,2,2,1, 

*,*,*,*,3,*,4,2,2,4,4,4,3,*,*,4,*,2,*,3,*,2,*,1,3,3,4,3,1,3,3,*,3,3,2, 

*,*,3,4,4,3,3,3,4,1,2,4,4,*,4,*,*,*,*,*,*,*,1,1,4,1,*,2,1,*] 



where * may mean anything else greater than 4 or do not know. Note that these 
two cases are different. In the following, we first consider the first case which is, in fact, 
very close to the Problem 3 in the previous section, because we know exactly the n k for 
k — 1, . . . , K' but we do not know other n k , k > K' nor the the true value of K > K' 
itself. However, N is given. We can only give an expression for the likelihood if we fix 
the value of K. Then, we can consider K = 5,6,7,... and for each case compute the 
results using (19): 

For K = 5 we obtain: = [0.1300, 0.1700, 0.1700, 0.2100, 0.03200] 

For K = 6 we obtain: 9 = [0.1300, 0.1700, 0.1700, 0.2100, 0.01600, 0.01600] 

For K = 7 we obtain: 6 = [0.1300, 0.1700, 0.1700, 0.2100, 0.01067, , 0.01067, 0.01067] 

and so on. 



A difficult question remains: How to fix Kl We may try to compare n(0\x,K) for 
different values of K through their entropies. We may also choose a prior for it and 
compute 7r(0, K\x) or still integrate out 6 to obtain tt(K\x) from which we can estimate 
K. 

The case where, the * in the data means do not know is more complex. If at least we 
know K, then it may still be possible to write the expression of the likelihood. Let's note 
the true values of n k by Nu k . Then, we know that Nv k =G [n k ,n k + k = 1,. . . ,K' 

and Nv k =G [0,n*], k = K', ...,K with n* = iV — J2k=i n k- Then, we may write 



K' K 

p( X \e,u,K,K') = n^cii-^^-^n^cMi-^)^-^ 

fc=i fc=i 

or 

K 

P(x\0,u,K) = Y[C% Uk 6^(l-6 k ) n *- N »K 

k=i 

We can then try to integrate out from this expression to obtain P(x\u, K) or integrate 
out v to obtain P(x\9, K). But, what to do if we do not know Kl Can we also integrate 
out K by summing over all values of Kl 

Another question that may arise in this problem and the previous ones, is to estimate 
the frequencies v k = n k /N which is not exactly the same question of estimating 9 k . In 
the following, we consider this problem. 

First consider the case of complete data {n, N, K} of problems 1 and 2. We may note 
that, if we assume that the die is fair, the knowledge of the past experience ({n, N, K}) 
does not change anything on the results of the future experience. But, if we do not know 



if the die is loaded, then from the past experience, we can estimate 6 and use it to 
compute the probability of observing any event. 

The situation becomes more complex if we do not know K or N or if some data are 
missing as is the case in problems 3 or 4, or more generally the cases where we cannot 
write easily the exact expression of the likelihood. 

Consider the incomplete data problem 4 where we know N, n k and n*, but we do 
not know K and assume that the * are a priori distributed uniformly between 1 and K 
(or between K' and K) and compute the numbers d k = (n k + n*/K)/N (or d k = (n k + 
n*/ (K — K'))/N). We can then say that these computed d k are good approximations to 
the true unobserved v k . The question is how to model this approximation. Two models 
can then be used: 

i) Assume d k as the mean values of the unknown frequencies v k 

d k = E{u k } = J v k p(v k )dv k (20) 

or 

ii) Assume each d k to be the sum of the true v k and a random error e k : 

d k = v k + e k (21) 

where e k is assumed to be centered with unknown pdf. In both cases, we are 
interested in finding p(v k \d k ) or p{y\d). 

But, before going further, it is important to note that, in the following, we are not going 
to analyze the original data x but the pre-processed data d. We changed the problem to 
a new one: Given d can we assign or compute p{v\d). 
Two approaches can then be used. 

Information Theory or Maximum Entropy approach: 

This approach is based on the first equation between d k and v k . It is obvious that, there 
are an infinite number of possible solutions to this equation. Let us denote by V this 
ensemble: 

V = {p : E{z/ fc } = J v k p(v k )dv k = d k }. (22) 

The Maximum Entropy principle chooses the one p ME {y k ) with the highest entropy 

p ME (u k ) = Mgm^{H(p)} (23) 

where 

H{p) = — J p(x)\np(x)dx, (24) 

or, more generally, if we assume a reference (prior?) distribution q(u k ), the one 
p MKL (vk) which has minimum Cross Entropy or Kullback-Leibler (KL) divergence 
[7, 11, 12], of p with respect to to q: 

p MKL (»k) = argmm{KL(p,q)} (25) 



where 

KL(p,q) = J p(x)ln(p(x)/q(x))dx. (26) 

We note that when q is uniform KL(p, q) = —H(p) and thus p MKL (u k ) = p ME (vk)- 
The unique solution, if exists, is given by 

p MKL (u k ) = -}— q (v k )e W {-\ k v k } (27) 

where 

z (h) = J q{y k ) exp {-A fc z/ fe } du k , (28) 
and it can be shown that \ k is the solution of the equation 

-d\nZ(X k )/d\ k = d k (29) 

which can be computed numerically. It is evident that the expressions of p MKL {y k ), 
Z(X k ), and consequently any numerical values for the estimate 

vf KL = EK} = J v k p MKL {v k ) dv k (30) 

depend on the choice of q. 

As a matter of algorithmic and computation of A (solution of the equation (29)) and 
v defined in (30), it is interesting to know that they can be computed through: 

[A =argmin A { J D(A) = lnZ(A) + A'd}, 
\v = arguing {H(u,u^)} 

where D{\) is called the dual criterion and H(is, u^) is called the primal criterion and 

where vf ] = E q {u k } = J v k q{y k )dv k . 

The expressions of dual and primal criteria also depends on the expression of q. For 
example, when q is uniform on C, p is exponential we have 

Z{\) = n(lAfc), ^Z(\) = -^hiA fc , D(X) = _^lnA fc + ^A fc 4 

k k k k 

and _ 

P(^^)) = -^lnK/,f) + ^K-,f). 

k k 

For other choices of q and more details on these relations refer to [13, 14, 15, 16, 17, 18, 
19,20,21,22]. 

Bayesian approach: 

The Bayesian approach is based on the second equation, i.e.,, d k = v k + e k and we have 
to find an expression for the likelihood C{v) = P{d\v) and assign a prior q(u k ) or 



q(v). When this done we can give an expression for the posterior n B {y\d). Note that, 
in both cases, we have to choose q(v). The first step, which is to find an expression for 
C{v) = P(d\u), is not easy. Here are a few approaches: 

Assuming 6 = v: 

The first approach consists in assuming 6 = v. Then, if we are also given N, the problem 
becomes equivalent to the Problem 2 and we have: 

P(d\u,N) = X[C* dk u* dk (l-u k ) N{l ~ d *\ (32) 

k 

Then, again choosing a uniform prior q(y) = — J2k u k)> we obtain 

n(u k \d,N) = B (Nd k - 1,N(1 - 4) - 1) (33) 

and then we have 

E{u k \d,N} = ^^. (34) 

We see that E{u k \d, N} i— > d k when A" goes to infinity. 

But, if we do not know N, we can try to integrate out N. Can we do it easily? I did 
not go further in this direction. 

Frequentist point of view: 

Here, we assume a priori that the die is fair and try to obtain an expression for the 
likelihood C(d\u,N) using the following arguments: 

Given iV and K and assuming that each through of the die is independent of all others, 
we may argue on the number of possible outcomes resulting to a particular data set using 
the multinomial coefficient 

TV' AH 

W(n,N, K) = — = (35) 

ni!...n fc ! nf=iK0 

W{n, N, K) is the number of possible outcomes x such that the face k appears n k times 
between the total possible outcomes which is K N . Thus, we may assign 



N\ 



P(n\N,K) = W(n,N,K)/(K N ) = ^ . (36) 

(^)nf=iK0 

It is known that, using the Stirling approximation 2 the expression of this probability, 
when is large, converges to 

K 

lim lnP(n\N,K) = H(v) = -y^v k lnv k (37) 

ATi — >oo ' J 

k=l 



2 Stirling (1692-1770) showed that x n = "+i /2 converges to \f2ivn when n goes to oo. This means 
that, for large n we get the approximation ln(n!) = |ln(2-7rn) +nlnn. However, even if this is usu- 
ally called Stirling's formula, in fact, it may have been known earlier to Abraham de Moivre (see 

http : / / www-gap .dcs.st-and.ac. uk/history/Mathematicians/De_Moivre . html). 



where u k = Hindoo ^. 

This explanation and this approximation have also been used to justify the choice 
of an expression for entropy H{u) = — J2 k u k lau k and a prior law for nb which is 
7r(n) oc exp {aH(u)}, so that, given a set of constraints on u k , finding the most probable 
(sampling argument or maximum likelihood approach) value of n subject to those 
constraints become equivalent to maximizing H{v) subject to those constraints: 

n k = argmax{ln7r(n)} = arg max {H(n) } . (38) 

n n 

But, we do not know either N or K. We may however try to use these expressions to 
find approximations to the likelihood function we need. First, we may assign 

P(d\v,N,K) = P(Nd\N,K)(l-P(Nu\N,K)), (39) 

and replacing for P(Nd\N,K) and P(Nu\N,K) and using again the Stirling formula 
we may find an expression which may be independent of N. 

Integration of nuisance parameter 0: 

Again here, we start by assuming N known. Then, we know the expressions of 
P(d\e,N) andP{u\0,N): 

P(d\e,N) = Y[C% dk 6* dk (1 - 9 k ) N(1 - d ^ (40) 
k 

and 

P(v\0,N) = d u Vk (l-0 k ) N{1 ' Uk \ (41) 

k 

Then we can write 

P(d\u,0,N) = (l-P(u\6,N)) P(d\6,N) 



i-\\c^e^{i-e k ) N ^ 



k 

xY[c^ dk e^ dk (i-e k ) N ^- dk \ (42) 

k 

Then, we have to integrate out to obtain the likelihood C{v) = P{d\v,N). Can we 
obtain simple expressions? Can we integrate out iV too? I did not go farther in this 
direction. 

Ad hoc empirical approach: 

Another approach is to assign the two pdfs p(e) = p(d k — v k ) and the prior q(u k ) from 
which we can compute 

7r B K|d fe ) = p(d k - v k ) q(v k ) I m(d k ). (43) 



Here too, the expression of the posterior pdf ir(u k \d k ) and thus any inference about v k 
depends on the choice of p(e) and q(v k ). 

A question may arise here: 
Can we first fix q(v k ) and compute p MKL {v k ) and use it again as a prior in this Bayesian 
approach? 

The answer is "No", because p MKL {v k ) is in fact p MKL (i' k \d k ) and doing so, we have 
used two times the same data d k . 

Another question is how to compare and how to use p MKL {v k \d k ) and ir B (v k \ dfc)? 
My answer is that n B contains more information than that of p MKL 5 because to obtain 
7r B , we combined information about both e k through p(e) and v k through q(v k ) while to 
obtain p MKL we used only q{v k ). Indeed, it seems that the only consistent point estimator 
of v k from p MKL is its posterior mean, while, there is not any such restriction on n B . 



An important case is the one where we have only given the mean value of the face 
numbers J2 k k9 k = do or the more general case of the mean value of the numbers 
written on the faces ^2, k gkQk = d without any other knowledge and, in particular, without 
knowing N. We need however to know K. 

Remember also that E{X} = ^2 k k9 k and E{G} = J2 k gk@k are not the same. They 
become equivalent if g k — k. 

Thus, we consider the case: 



and we assume to know the number of states K. The objective is to find 9 k . 
MaxEnt solution: 

The classical answer this problem is MaxEnt which can be described as follows: 

It is obvious that, there are infinite number of possible solutions to the equation (44). 

The Maximum Entropy principle chooses the one with the highest entropy 



PROBLEM 5 




(44) 



k 




(45) 



k 



The solution has the form 



9 k (X) 



1 



exp{-Xg k } = exp{-(lnZ(\) + \g k )} , 



Z(X) 



(46) 



where 




(47) 



k 



and A is the solution of the following equation 



-d\nZ(\)/d\ = d 



(48) 



which can be computed numerically. 

It is also easy to show that the maximum value of the entropy is 

H max (d) = -Y^6 k \n6 k = lnZ(A) + \d = maxln# fc (A) (49) 

A 

k 

which can also be written 

max 6>(A) = exp {H max (6)} . (50) 

A 

Bayesian solution: 

If we knew N, we could write the expression of the likelihood P(D = d\0,N) with 
d = Y.k9kn k and J2 k n k = N: 

N 

P(D = d\0,N) = H P(n\0)5(N -J2n k )S(d-Y,9kn k ). (51) 

n k =0 k k 

We can also try to integrate out N: 

oo N 

P(D = d\6) = J2U P(n\e)5(N -J2n k )5(d-J29kn k ). (52) 

AT=0ra fe =0 k k 

These computations seem to me intractable. In the following, I propose another ap- 
proach: 

The main idea here is that, we may account for uncertainty of this data (in particular, 
because we do not know the value of N) by assuming 

p{d\e)^u[d-Y,g k 9 k ,a^j, (53) 

and by arguing on the additivity and positivity of we choose 

tt(0) = exp {-H(0)}. (54) 

Then, the posterior is 

ir(6\d) = exp | ~(d - 9k k ) 2 ~ H{0) J , (55) 



and the MAP solution is 



= arg mm j (d - ^ g k 9 k f - aH{0) J (56) 



with a = 2a 2 . 

Now, if we choose H{6) = ^ fc 9 k \n6 k the numerical results obtained by this approach 
and those obtained by using the MaxEnt solution become almost identical. However, if 
we can fix the value of a, we have access to the n(6\d) which contains more information 
than only one point estimator. 

Combined data fusion solution: 

Assume now that, not only we have the data x or n, but also d from previous section. 
How to combine them? Here is my solution. 

Follow the Bayesian approach of the sections 1 or 2 to write down the expression of 
the a posteriori law 

\nn{0\n) = ^[n k \n6 k + {N-n k )\n{l-6 k )]+\nit{0) + c (57) 

k 

and use the expression of n(6\d) in equation (55) as the prior it (6) here. 



PROBLEM 6 

Assume now that, our observer has repeated the experience L times, and before each 
experience, he has changed the numbers written on each face. For example, the first 
time, he has written g k = k and for the second experience g k = k 2 . This is also equivalent 
to the experiment of using L similar dice with different colors and different labeling on 
each faces simultaneously. Then, he computed the numbers n k i. 

But, assume now that, finally, he gives us only the mean values n { = (1/iV) J2k n ki or 
di — (1/N) J2 k 9ki- The problem is similar to the previous case, but here we have L data: 

^gkiO k = di, Z = 1,...,L, (58) 

k 

which can be written GO = d where G is the matrix with elements g k i. Thus, we have a 
linear system of equations with K unknowns and L data. Note that here we know exactly 
the values g k \. 

If the experimenter has made good choices for g k i and if L = K, then we may only 
try to solve that system of equations and obtain an exact solution to the problem. But, 
what if L < K or if the experimenter has not made a good choice for g kh for example, 
if he has naively written g ki = kl. In both cases, the system of equations has an infinite 
number of solutions. 

MaxEnt solution: 

The MaxEnt approach is again straightforward and the solution has the form 

6k = z(~\) exp I ~^2 Xl9kl r =ex P j -(^zW + ^^igki) >, (59) 



where 

Z(A) = ^exp j-^A^J, (60) 

and A = [Ai, . . . , A L ] is the solution of the following equation 

-d\nZ(X)/d\i = d l (61) 

which can be computed numerically. It is also easy to show that the maximum value of 
the entropy is 

H max (0) = -^2e k \n9 k = lnZ(A) + XU = maxln0(A) (62) 

k 

which can also be written 

max0(A)=exp{# ma:r (0)}. (63) 

A 

Bayesian solution: 

Following the steps of the section 5, we have 

p(d\0) =Af(d-G0,a 2 ) (64) 
and by arguing on the additivity and positivity of we choose 

tt(0) = exp{-H(6)}. (65) 

Then, the posterior is 

ir(0\d) = exp |-^L||d - G0\\ 2 - #(0) j (66) 
and the MAP solution is 

= argmin{||d-G0|| 2 -a#(0)} (67) 



with a = 2a 2 . 

Combined data fusion solution: 

Assume now that, not only we have the data x or n, but also d from the previous section. 
How to combine them. Here again we can follow the Bayesian approach of the sections 
1 or 2 to write down the expression of the a posteriori law 

hi7r(0|n) =^2[n k ln0 k + (N-n k )ki(l-9 k )}+kiir{O)+c (68) 

k 

and use the expression of 7r(0|d) in equation (66) as the prior 7r(0) here. 



PROBLEM 7 



Consider the same previous experiment, but this time, the experimenter is sure that all 
dice were absolutely identical and unloaded, but he has forgotten to note the numbers he 
has written on the dice faces. 

However, he has also noted the mean values (l/L)J2i 9ki = d k . Can we be of any help 
for him to find them? 

Thus, this time, 9 k = 1/ K,k = 1, . . . ,K and we have 

Y J g k iO k = {i/K)Y J gki = d h 1 = 1,. ..,L, (69) 

k k 

and also (l/L)J2i9ki = 4, k = l,...,K. 

The problem becomes an interesting one, we want to compute the elements of a 
matrix from its row and column sums. This mathematical problem arises in many other 
applications such as computed tomography where we want to recover the pixel values 
of an image from its horizontal and vertical projections. 

Except the case of K = L = 2, we have always less data than unknowns and the 
problem has an infinite number of solutions. Even in the case K = L = 2 where the 
number of unknowns and data are equal, the problem is still under-determined and has 
infinite number of solutions. 

We need to question our experimenter to see if he can remember of any other infor- 
mation about those numbers (prior information or constraints?) which can be helpful to 
give reasonable answers about this question. 

To go further in details of this problem, let's change slightly the notation. We want to 
estimate the elements g ki of a (K x L) matrix G from its row sums r k = g ki and its 
column sums q = J2k9ki- 

We may also note r = [r 1: . . . ,r K \, c= [c 1: . . . ,c L \, d= [r;c] and g a vector containing 
all the elements of the matrix G concatenated column by column. Then, it is easy to see 
that we can also write c = A x g, r = A 2 g and thus d = Ag where A u A 2 and A are, 

r Ao 

respectively, a (K x KL), a (L x KL) and a ((K + L) x KL) matrix with A = . 

and whose elements are composed of zeros and ones. 

Now, we consider two sets of answers of our experimenter: those who put determin- 
istic constraints on g ki and those who put probabilistic constraints. 

Deterministic constraints: 

• g k i = g k . Then, we have r k = Lg k and we have a unique solution g k = r k /L subject 
to the condition that Y. k 9k = f^k r k = c h l = l,---,L. 

• g k i = gi. Then, we have q = Kgi and we have a unique solution gi — ci/K subject 
to the condition that J2i9i = j?J2i c i = r k, k = 1,...,K. 

• 9ki = 9i k 92 l ■ Then, we have r k = g lk J2i 92 t and q = g 2l Y,k 9i k and we nave 9i k ^ r k 
and g 2l oc q. There still remains two unknowns J2i92i an( i Y2k9^k- However, if g lk 
and g 2l are normalized, then we have a unique solution. 



• g k i are normalized as they represent a probability distribution: ^2 k g k i = J2i9ki — 
YlikYliiQki — 1- This information is not enough to find a unique solution. That 
becomes true if g kl is separable as in the previous case. 

• g k i are normalized as they represent a probability distribution: ^ k g k i = J2i9ki — 
YlkYliQki — 1 an d and are distributed as uniformly as possible over the 

grid{(fc,z),fc = i,...,ir,z = i,...,L}. 

This information may be enough to find a solution if it exists, by maximizing 
H(g) = — Ylj9j ^ n 9j subject to the data constraint Ag = d and the normalization 
constraint YlijQj = 1- Then the solution is given by g = ■^^exp{A t X} where 

A is the solution of — d\nZ(X)/d\j = dj which can also be computed by A = 
argmin A { J D(A) = lnZ(A) + A*d}. 

Unfortunately, there is not an explicit expression for this solution, but it is by 
construction positive (gj oc exp{[A*A]j}) and satisfies the data and normalization 
constraints for any correct data sets. Note also that this solution is not a linear 
function of the data. 

There is only one question remaining: Is there any other criteria H(g) which can 
give these satisfactions? 

To give a partial answer to this question, we may say that any convex criterion 
can be used to find a unique solution. For example, H(g) = J2j9j = WdW 2 which 
gives the minimum norm (generalized inverse) solution g = A + d which becomes 
g = A t {AA t )^ l d if A A 1 was invertible. Note that this solution is a linear function 
of the data, but, this criterion does not guarantee the positivity of the solution. 

Another example is H{g) = Yl,j^ n 9i which gives the solution of the form gj = 
but, this criterion does not guarantee neither the positivity or the boundedness 
of the solution. One can find other convex criteria (see next section). 

Probabilistic constraints: 

• We know that g EC and that we generated g according to a reference measure q(g) 
over C such that E q {g} = g . Now, again, we can use the ME tool and search for 
p(g) such that AE p {g} = d and minimizes KL(p,q). We know that the solution 
is p(g) = -^q(g) exp ^X t A t gj where A is the solution of — d\nZ(X)/d\j = dj 

which can also be computed by A = argmin A (-D(A) = lnZ(A) + X t dj and finally, 
the solution g = E p {g} can be computed by g = axgmmA g =d{H (g , g } . However, 
as we discussed it before, the expression of H depends on the choice q(g): 
For C a closed set of real numbers and q(g) Gaussian, we have 

H(g,gW) = \\g-g \\ 2 . 
For for C a closed set of real numbers and q(g) a Lebesgue measure on C, we have 

H(g,g (0) ) = -Y,1n( Ss /go J ) + (g s -go J ), 

j 



and, finally, 

For C a closed set of integer numbers and q(g) Poissonian, we have 



H(g,g<®) = KL{g,g ) = J^Xft-/^-) + (<?, -g 0j ). 



j 



This discussion shows a relation between the classical ME approach of the last 
section and the ME in the mean as is presented here. Even if here, we have a tool to 
derive the expression of the needed convex criterion, still an arbitrary remains on 
the choice of C and the reference measure q(g). 

• Each element g ki has been generated independently using a Gaussian random 
number generator: g kl ~ Af(k, A). 

• Each element g k i has been generated independently using a Gaussian random 
number generator: g ki ~ A). 

• Two sets of numbers g lk and g 2l have been generated using a Gaussian random 
number generator gi k ~ Af(k,\i) and g 2l ~ A/"(/, A 2 ), then normalized and point- 
wise multiplied: g kl = gi k g 2r 

• Each element g k i has been generated independently using a random number gener- 
ator. We consider two interesting cases: g k i ~ A/"(/i, A) and g k i ~ "P(A). 

• The elements gu,g k i,giL,gKi have been generated independently using a random 
number generator A/"(0, 1), but others are generated by g k i ~ -Af(gki, 1) where g ki = 

\ [gk-1,1 + gk,i-i + gk+1,1 + gk,i+i] ■ 

Let's consider only the case of independent Gaussian g ki ~ jV(//, A) and g k i ~ V(\) 
where we may be able to do all the computations. 

Gaussian case: 



We have: 




Then, the column sums q and rows sums r k are also Gaussian: 




k 



and thus: 





Then, we can write the expression of the posterior law: 



p(g kl \r,c,X) oc P{g kh r,c\\) = exp |-^-(^ fc , -^) 2 | 



x exp • 



with r k = ^2g kl and Cl = ^g kl . 

i k 

It is then easily seen that 



p(g M \r,c,X) oc exp \ - — J(g kl ) 



1 x i 

with j( 9kl ) = (g kl -^ 2 + —j2(J2 9ki ~ L ^ 2+ iJ2 ( J2 9ki - K ^ 



k i 



is Gaussian and we can easily compute its mean and variance. To obtain the mean values, 
we can compute the derivative of 

J = (g kl - fi) 2 + -1 ^(r fc - Lfi) 2 + - J2( c i ~ K V? 



which is 



2 . 2 . 

dJ/dg M = 2( y g M -fi) + — ^2( y r k -Lfi) + -^2( y ci-Kfi) 



and equate it to zero to obtain 



9kl 



KL 



KL + 2L + 2K 



Cl 



(70) 



This result is interesting, because J2 k r k + \ J2i c i * s wnat * s called the back-projection 
in computed tomography. 



We can generalize these results, if we work with the vectors g, r = A x g, c = A 2 g 

2 



and d = [J = }g = Ag. Then, we have 
g~M{g Q ,R g ), d~ N^Ag^ARgA 1 ), 



■ M 



9o 
Ago 



R g R g A l 
AR g AR g A l 



and thus 

g\d~Af(g,R g ), with 



g = <7 + RgA^ARgAYid- Ag ) 

Rg = Rg~ RgA t (ARgA t ) + ARg 



where (Ai2 9 A*) + is the generalized inverse of AR g A l 

Note that when AR g A f is invertible, we have g 
For the particular case of R g = XI we have 



A^d and R g = 0. 



g = g + A t (AA t )+(d-Ag Q ) 
R g = \(I-A t (AA t ) + A) 



(71) 



A 1 
A 2 



we have 



~A 1 A\ 


A\A\ 




KI 1 " 


A 2 A\ 


A 2 A%_ 




1 LI 



For the particular case of A = 

AA l 



where 1 is a matrix with all its elements equal to 1 . We may note that A A 1 is singular 
and its rank is K + L — 1. We can however compute numerically g and R g . Note also 
that, even if a priori g k[ were independent, a posteriori they are correlated. 

Poisson case: gki^V(X): 
Here, we have: 

P(g k i) = A 9fci exp {-A} /(g kl \) — > \nP{g kl ) = {\nX)g kl - \n{g kl \) - X 

and 

gu~V(\), r k = J29ki-V{LX), Cl = Y,9ki~V(K\). 

i k 

Then, we can write 

P{r) = l[(L\y* exp {-LA} /(r k \), P(c) = H(KX) C ' exp{-KX} /(q!) 
k i 

and 

P(g kl \r,c,X) oc (A)«"/(<7«!) l[(LX)^ /(r k l) ]J(KXy< /( Cl l) 
with r fc = 5^0 fc , and c, = 

It is then possible to show that P(g k i\r, c, A) is also a Poisson law, but it is not easy to 
find an explicit expression for its mean value. However, using again the Striling formula 
when working with \nP(g k i\r, c, A) one can obtain an approximate expression for it 

P(9ki\ {gk'&,i'#}, r, c, A) = V(X(1 + Lexp {q} + Kexp {r k })), 

and thus we have 

^{9ki\{gk'^k,i'^i},r,c,X} = A(l + Lexp{Q} + Xexp{r fc }) 

= KLX(1/(KL) + (1/K) exp{c,} + (1/L) exp{r fc }). 

(72) 



This is interesting, because (l/K) exp{q} + (1/L) exp {r fc } corresponds again to the 
famous back-projection operation in computed tomography, but here, in place of back- 
projecting q and r k themselves, their exponential values exp {a} and exp {r k } are back- 
projected. 

CONCLUSIONS 

This paper was another analysis of dice problems trying to answer some of the ques- 
tions about the situations where we can use the Bayesian or the Maximum Entropy 
approaches. Through this paper, we distinguished three approaches: Bayesian, classical 
MaxEnt and MaxEnt on the mean. I showed some of the situations where we can use 
these approaches. 

The Bayesian approach can be used when we can write explicitly a probabilistic 
model relating the data to the unknown parameters from which we can deduce the 
expression of the likelihood and can assign an a priori law to those parameters, we 
can then use the Bayesian approach to compute the a posteriori from which we can 
infer about the parameters. 

The classical MaxEnt approach can be used in cases where we have a set of data which 
can be considered as linear constraints on a set of parameters which are themselves 
a probability distribution. Then the classical MaxEnt gives the possibility of finding a 
unique solution to the under-determined problem. 

The MaxEnt on the mean approach can be used in cases where we have a set of 
data which can be considered as linear constraints on the expected values of a set of 
parameters which are the elements of a convex set on which we can define a reference 
measure. Then, we can use the MaxEnt on the mean approach to compute a probability 
law on that set such that the expected values of the parameters satisfy exactly the 
data. We can then compute those expected values which depend on the choice of the 
reference measure. We showed also that there are strong relation between the two 
MaxEnt approaches. 

In some cases, it may happens that we have both the moment data and the sampling 
data. Then we can first use the MaxEnt approach to assign the prior law using the 
moment data and then use it with the likelihood to compute the a posteriori law of 
the parameters from which we can infer about them. 

Finally, even if I tried to answer to some of the questions, I also asked more questions 
to be answered. We thus still have a lot to do with all the three approaches. However, it 
seems that for practical applications the Bayesian approach seems to be the right and the 
easiest one. 
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